This week's lab focuses on data modelling using linear regression. At the end of the lab, you should be able to use scikit-learn
to:
Let's start by importing the packages we need. This week, we're going to use the linear_model
subpackage from scikit-learn
to build linear regression models using the least squares technique. We're also going to need the dummy
subpackage to create a baseline regression model, to which we can compare.
In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict
Next, let's load the data. This week, we're going to load the Auto MPG data set, which is available online at the UC Irvine Machine Learning Repository. The dataset is in fixed width format, but fortunately this is supported out of the box by pandas
' read_fwf
function:
In [ ]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
df = pd.read_fwf(url, header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model year', 'origin', 'car name'])
According to its documentation, the Auto MPG dataset consists of eight explantory variables (i.e. features), each describing a single car model, which are related to the given target variable: the number of miles per gallon (MPG) of fuel of the given car. The following attribute information is given:
Let's start by taking a quick peek at the data:
In [ ]:
df.head()
As the car name is unique for each instance (according to the dataset documentation), it cannot be used to predict the MPG by itself so let's drop it as a feature and use it as the index instead:
Note: It seems plausible that MPG efficiency might vary from manufacturer to manufacturer, so we could generate a new feature by converting the car names into manufacturer names, but for simplicity lets just drop them here.
In [ ]:
df = df.set_index('car name')
df.head()
According to the documentation, the horsepower column contains a small number of missing values, each of which is denoted by the string '?'
. Again, for simplicity, let's just drop these from the data set:
In [ ]:
df = df[df['horsepower'] != '?']
Usually, pandas
is smart enough to recognise that a column is numeric and will convert it to the appropriate data type automatically. However, in this case, because there were strings present initially, the value type of the horsepower column isn't numeric:
In [ ]:
df.dtypes
We can correct this by converting the column values numbers manually, using pandas
' to_numeric
function:
In [ ]:
df['horsepower'] = pd.to_numeric(df['horsepower'])
# Check the data types again
df.dtypes
As can be seen, the data type of the horsepower column is now float64
, i.e. a 64 bit floating point value.
According to the documentation, the origin variable is categoric (i.e. origin = 1 is not "less than" origin = 2) and so we should encode it via one hot encoding so that our model can make sense of it. This is easy with pandas
: all we need to do is use the get_dummies
method, as follows:
In [ ]:
df = pd.get_dummies(df, columns=['origin'], drop_first=True)
df.head()
As can be seen, one hot encoding converts the origin column into separate binary columns, each representing the presence or absence of the given category. Because we're going to use a linear regression model, to avoid introducing multicollinearity, we must also drop the first of the encoded columns by setting the drop_first
keyword argument to True
.
Next, let's take a look at the distribution of the variables in the data frame. We can start by computing some descriptive statistics:
In [ ]:
df.describe()
Print a matrix of pairwise Pearson correlation values:
In [ ]:
df.corr()
Let's also create a scatter plot matrix:
In [ ]:
pd.plotting.scatter_matrix(df, s=50, hist_kwds={'bins': 10}, figsize=(16, 16));
Based on the above information, we can conclude the following:
For now, we'll just note this information, but we'll come back to it later when improving our model.
Let's start our analysis by building a dummy regression model that makes very naive (often incorrect) predictions about the target variable. This is a good first step as it gives us a benchmark to compare our later models to.
Creating a dummy regression model with scikit-learn
is easy: first, we create an instance of DummyRegressor
, and then we evaluate its performance on the data using cross validation, just like last week.
Note: Our dummy model has no hyperparameters, so we don't need to do an inner cross validation or grid search - just the outer cross validation to estimate the model accuracy.
In [ ]:
X = df.drop('mpg', axis='columns') # X = features
y = df['mpg'] # y = prediction target
model = DummyRegressor() # Predicts the target as the average of the features
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0) # 5 fold cross validation
y_pred = cross_val_predict(model, X, y, cv=outer_cv) # Make predictions via cross validation
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())
ax = (y - y_pred).hist()
ax.set(
title='Distribution of errors for the dummy model',
xlabel='Error'
);
The dummy model predicts the MPG with an average error of approximately $\pm 6.57$ (although, as can be seen from the distribution of errors the spread is much larger than this). Let's see if we can do better with a linear regression model.
scikit-learn
supports linear regression via its linear_model
subpackage. This subpackage supports least squares regression, lasso regression and ridge regression, as well as many other varieties. Let's use least squares to build our model. We can do this using the LinearRegression
class, which supports the following options:
fit_intercept
: If True
, prepend an all-ones predictor to the feature matrix before fitting the regression model; otherwise, use the feature matrix as is. By default, this is True
if not specified.normalize
: If True
, standardize the input features before fitting the regression model; otherwise use the unscaled features. By default, this is False
if not specified.Generally, it makes sense to fit an intercept term when building regression models, the exception being in cases where it is known that the target variable is zero when all the feature values are zero. In our case, it seems unlikely that an all-zero feature vector would correspond to a zero MPG target value (for instance, consider the meaning of model year = 0
and weight = 0
in the context of the analysis). Consequently, we can set fit_intercept=True
below.
Whether to standardize the input features or not depends on a number of factors:
In our case, as we are not generating supplmental new features and several of the features are not normally distributed (see the scatter plot matrix above), we can choose not to standardize them (normalize=False
) with no loss in advantage.
Note: In cases where there is uncertainty as to whether an intercept should be fit or not, or whether the input features should be standardized or not, or both, we can use a grid search with nested cross validation (i.e. model selection) to determine the correct answer.
In [ ]:
X = df.drop('mpg', axis='columns') # X = features
y = df['mpg'] # y = prediction target
model = LinearRegression(fit_intercept=True, normalize=False) # Use least squares linear regression
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0) # 5-fold cross validation
y_pred = cross_val_predict(model, X, y, cv=outer_cv) # Make predictions via cross validation
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())
ax = (y - y_pred).hist()
ax.set(
title='Distribution of errors for the linear regression model',
xlabel='Error'
);
Our linear regression model predicts the MPG with an average error of approximately $\pm 2.59$ and a significantly smaller standard deviation too - this is a big improvement over the dummy model!
But we can do better! Earlier, we noted that several of the features had non-linear relationships with the target variable - if we could transform these variables, we might be able to make this relationship more linear. Let's consider the displacement, horsepower and weight variables:
In [ ]:
pd.plotting.scatter_matrix(df[['mpg', 'displacement', 'horsepower', 'weight']], s=50, figsize=(9, 9));
The relationship between the target and these predictors appears to be an exponentially decreasing one: as the predictors increase in value, there is an exponential decrease in the target value. Log-transforming the variables should help to remove this effect (logarithms are the inverse mathematical operation to exponentials):
In [ ]:
df['displacement'] = df['displacement'].map(np.log)
df['horsepower'] = df['horsepower'].map(np.log)
df['weight'] = df['weight'].map(np.log)
Now, the relationship between the predictors and the target is much more linear:
In [ ]:
pd.plotting.scatter_matrix(df[['mpg', 'displacement', 'horsepower', 'weight']], s=50, figsize=(9, 9));
Let's run the analysis a second time and see the effect this has had:
In [ ]:
X = df.drop('mpg', axis='columns')
y = df['mpg']
model = LinearRegression(fit_intercept=True, normalize=False)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
y_pred = cross_val_predict(model, X, y, cv=outer_cv)
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())
ax = (y - y_pred).hist()
ax.set(
title='Distribution of errors for the linear regression model with transformed features',
xlabel='Error'
);
As can be seen, the average error has now decreased to $\pm 2.33$ and the standard deviation of the error to 3.12. Further reductions in error might be achieved by experimenting with feature selection, given the high degree of correlation between some of the predictors, or with a more sophisticated model, such as ridge regression.
Once we have identified an approach that satisfies our requirements (e.g. accuracy), we should build a final model using all of the data.
In [ ]:
X = df.drop('mpg', axis='columns')
y = df['mpg']
model = LinearRegression(fit_intercept=True, normalize=False)
model.fit(X, y) # Fit the model using all of the data
We can examine the values of the intercept (if we chose to fit one) and coefficients of our final model by printing its intercept_
and coef_
attributes, as follows:
In [ ]:
print(model.intercept_)
print(model.coef_) # Coefficients are printed in the same order as the columns in the feature matrix, X